Automatic Acquisition of Language Model based on Head-Dependent Relation between Words

نویسندگان

  • Seungmi Lee
  • Key-Sun Choi
چکیده

Language modeling is to associate a sequence of words with a priori probability, which is a key part of many natural language applications such as speech recognition and statistical machine translation. In this paper, we present a language modeling based on a kind of simple dependency grammar. The grammar consists of head-dependent relations between words and can be learned automatically from a raw corpus using the reestimation algorithm which is also introduced in this paper. Our experiments show that the proposed model performs better than n-gram models at 11% to 11.5~ reductions in test corpus entropy. 1 I n t r o d u c t i o n Language modeling is to associate a priori probability to a sentence. It is a key part of many natural language applications such as speech recognition and statistical machine translation. Previous works for language modeling can be broadly divided into two approaches; one is ngram-based and the other is grammar-based. N-gram model estimates the probability of a sentence as the product of the probability of each word in the sentence. It assumes that probability of the nth word is dependent on the previous n 1 words. The n-gram probabilities are estimated by simply counting the n-gram frequencies in a training corpus. In some cases, class (or part of speech) n-grams are used instead of word n-grams(Brown et al., 1992; Chang and Chen, 1996). N-gram model has been widely used so far, but it has always been clear that n-gram can not represent long distance dependencies. In contrast with n-gram model, grammarbased approach assigns syntactic structures to a sentence and computes the probability of the sentence using the probabilities of the structures. Long distance dependencies can be represented well by means of the structures. The approach usually makes use of phrase structure grammars such as probabilistic context-free grammar and recursive transition network(Lari and Young, 1991; Sneff, 1992; Chen, 1996). In the approach, however, a sentence which is not accepted by the grammar is assigned zero probability. Thus, the grammar must have broadcoverage so that any sentence will get non-zero probability. But acquisition of such a robust grammar has been known to be very difficult. Due to the difficulty, some works try to use an integrated model of grammar and n-gram compensating each other(McCandless, 1994; Meteer and Rohlicek, 1993). Given a robust grammar, grammar-based language modeling is expected to be more powerful and compact in model size than n-gram-based one. In this paper we present a language modeling based on a kind of simple dependency grammar. The grammar consists of head-dependent relations between words and can be learned automatically from a raw corpus using the reestimation algorithm which is also introduced in this paper. Based on the dependencies, a sentence is analyzed and assigned syntactic structures by which long distance dependences are represented. Because the model can be thought of as a linguistic bi-gram model, the smoothing functions of n-gram models can be applied to it. Thus, the model can be robust, adapt easily to new domains, and be effective. The paper is organized as follows. We introduce some definitions and notations for the dependency grammar and the reestimation algorithm in section 2, and explain the algorithm in section 3. In section 4, we show the experimental results for the suggested model compared to n-gram models. Finally, section 5 concludes this paper. 2 A S i m p l e D e p e n d e n c y G r a m m a r In this paper, we assume a kind of simple dependency grammar which describes a language

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic Construction of Persian ICT WordNet using Princeton WordNet

WordNet is a large lexical database of English language, in which, nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms (synsets). Each synset expresses a distinct concept. Synsets are interlinked by both semantic and lexical relations. WordNet is essentially used for word sense disambiguation, information retrieval, and text translation. In this paper, we propose s...

متن کامل

A new model for persian multi-part words edition based on statistical machine translation

Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some s...

متن کامل

The Effect of Social and Cultural Factors on Generation Gap

This study focuses on the effect of social and cultural determinants on generation gap in Tehranian families in 2011. The purpose of this study is to determine effective factors on generation gap in Tehranian families, analytical and empirical patterns and be surveyed by related theories and effective factors. The research was prepared by questions including whether there is a relationship betw...

متن کامل

اختلال در شبکه معنایی بیماران اسکیزوفرنیک: آماده‌سازی معنایی با ارائه همزمان دو آماده‌ساز

Abstract Objectives: The present study was designed to investigate the automatic activation of seman-tic priming in schizophrenic patients. Method: 36 schizophrenic patients and 36 normal sub-jects participated in two experiments. In experiment one, the effect of semantic relation on iden- tification of degraded targets was examined between a series of single prime words and single target words...

متن کامل

An automatic acquisition method of statistic finite-state automaton for sentences

Statistic language models obtained from a large number of training samples play an important role in speech recognition. In order to obtain higher recognition performance, we should introduce long distance correlations between words. However, traditional statistic language models such as word n-grams and ergodic HMMs are insufficient for expressing long distance correlations between words. In t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1998